The purpose of this code is to perform exploratory data analysis on
the mortality data in the nc2016.csv data set that I prepared in the
data aggregation project. (Please see the 2016Mort_dataAggr.html file to
view the data aggregation project.)
My goals for this project are to check linearity, normality, analyze the
distribution of the data, discover any correlations and check for
collinearity and multicollinearity.
Before we begin the analyses here is a brief explanation of the
layout of this document and what each of the parts are.
This page displays code, output, and discussions about the output. The
layout is as follows:
1. Code is displayed in gray “code boxes”
# This is a code box
x <- c(1,3,2,4,8,4,7)
y <- c(9,3,1,5,7,2,6)
plot(x, y, col='#04918a', main="Output of code", type = "p", pch = 19)
nc2016 <- read.csv("/Users/tamara/Documents/DataProjects/Mortality/nc2016.csv")
nc2016
stargazer(nc2016, type = 'html',
keep = c("mentUnhDays", "perAccToExer", "perExcDrink", "Unemploy", "perUnemploy", "perInsSleep", "foodEnvIndx", "chlamyd", "perChlamyd", "tBirthRate", "violenCrime", "vioCrimeRate", "ageAdjRate"),
covariate.labels = c("Mentally Unhealthy Days", "Access to exercise opportunities", "Excessive drinking", "Unemployment", "Unemployment Rate", "Percent Insufficient Sleep", "Food environment index", "Chlamydia Cases", "Chlamydia Rate", "Teen Birth Rate", "Violent Crimes", "Violent Crime Rate", "Aged Adjusted Mortality Rate"),
title = "Descriptive Statistics")
| Statistic | N | Mean | St. Dev. | Min | Max |
| Mentally Unhealthy Days | 100 | 4.211 | 0.299 | 3.368 | 5.400 |
| Access to exercise opportunities | 100 | 68.250 | 18.950 | 19.330 | 100.000 |
| Excessive drinking | 100 | 16.050 | 1.894 | 12.130 | 22.620 |
| Unemployment | 100 | 2,464.000 | 3,931.000 | 128 | 27,415 |
| Unemployment Rate | 100 | 5.640 | 1.184 | 3.844 | 9.406 |
| Percent Insufficient Sleep | 100 | 33.660 | 2.771 | 28.600 | 39.570 |
| Food environment index | 98 | 7.202 | 0.870 | 4.100 | 8.500 |
| Chlamydia Cases | 100 | 643.800 | 1,271.000 | 10 | 9,276 |
| Chlamydia Rate | 100 | 0.005 | 0.003 | 0.001 | 0.012 |
| Teen Birth Rate | 100 | 0.034 | 0.011 | 0.006 | 0.061 |
| Violent Crimes | 95 | 341.100 | 686.700 | 6.333 | 5,376.000 |
| Violent Crime Rate | 95 | 0.003 | 0.001 | 0.001 | 0.007 |
| Aged Adjusted Mortality Rate | 100 | 8.265 | 0.974 | 5.368 | 10.500 |
par(mfrow=c(1,2))
hist(nc2016$ageAdjRate, main = "Age Adjusted Mortality Rate", xlab = "Rate", col = "purple")
boxplot(nc2016$ageAdjRate, horizontal = TRUE, main = "Age Adjusted Mortality Rate", xlab = "Rate", col = "purple")
We can see from the above histogram and boxplot that the Age
Adjusted Mortality Rate variable is distributed close to normally and so
it will be fine as it is as a dependent (response) variable.
par(mfrow=c(2,2))
hist(nc2016$mentUnhDays, main = "Mentally Unhealthy Days", xlab = "Days", col = "purple")
hist(nc2016$perInsSleep, main = "Percent Insufficient Sleep", xlab = "Percent with Insufficient Sleep", col = "purple")
boxplot(nc2016$mentUnhDays, horizontal = TRUE, main = "Mentally Unhealthy Days", xlab = "Days", col = "purple")
boxplot(nc2016$perInsSleep, horizontal = TRUE, main = "Percent Insufficient Sleep", xlab = "Percent with Insufficient Sleep", col = "purple")
Both the above variables appear to be normally distributed,
however Mentally Unhealthy Days does have a couple outliers to
consider.
par(mfrow=c(2,2))
hist(nc2016$perAccToExer, main = "Access to Exercise Opportunities", xlab = "Percent with access", col = "purple")
hist(nc2016$perExcDrink, main = "Excessive Drinking", xlab = "Percent excessive drinking", col = "purple")
boxplot(nc2016$perAccToExer, horizontal = TRUE, main = "Access to Exercise Opportunities", xlab = "Percent with access", col = "purple")
boxplot(nc2016$perExcDrink, horizontal = TRUE, main = "Excessive Drinking", xlab = "Percent excessive drinking", col = "purple")
Access to Exercise Opportunities is slightly skewed left while
Excessive Drinking is skewed right with an outlier.
par(mfrow=c(2,2))
hist(nc2016$foodEnvIndx, main = "Food Environment Index", xlab = "Index", col = "purple")
hist(nc2016$tBirthRate, main = "Teen Birth Rate", xlab = "Rate", col = "purple")
boxplot(nc2016$foodEnvIndx, horizontal = TRUE, main = "Food Environment Index", xlab = "Index", col = "purple")
boxplot(nc2016$tBirthRate, horizontal = TRUE, main = "Teen Birth Rate", xlab = "Rate", col = "purple")
Food Environment Index is skewed left and Teen Birth Rate is
mostly normally distributed.
par(mfrow=c(2,2))
hist(nc2016$violenCrime, main = "Violent Crimes", xlab = "Number of violent crimes", col = "purple")
hist(nc2016$vioCrimeRate, main = "Violent Crime Rate", xlab = "Rate", col = "purple")
boxplot(nc2016$violenCrime, horizontal = TRUE, main = "Violent Crimes", xlab = "Number of violent crimes", col = "purple")
boxplot(nc2016$vioCrimeRate, horizontal = TRUE, main = "Violent Crime Rate", xlab = "Rate", col = "purple")
Both of these variables are skewed right but the Violent Crime
Rate is not as strongly skewed. I will include Violent Crime Rate in my
analyses going forward but not Violent Crimes.
par(mfrow=c(2,2))
hist(nc2016$chlamyd, main = "Chlamydia Cases", xlab = "Number of violent crimes", col = "purple")
hist(nc2016$perChlamyd, main = "Chlamydia Rate", xlab = "Percent with STI", col = "purple")
boxplot(nc2016$chlamyd, horizontal = TRUE, main = "Chlamydia Cases", xlab = "Number of violent crimes", col = "purple")
boxplot(nc2016$perChlamyd, horizontal = TRUE, main = "Chlamydia Rate", xlab = "Percent with STI", col = "purple")
Again both of these variables are skewed right but the
Chlamydia Rate is not as strongly skewed. I will include Chlamydia Rate
in my analyses going forward but not Chlamydia Cases.
par(mfrow=c(2,2))
hist(nc2016$Unemploy, main = "Number Unemployed", xlab = "Rate", col = "purple")
hist(nc2016$perUnemploy, main = "Percent Unemployed", xlab = "Rate", col = "purple")
boxplot(nc2016$Unemploy, horizontal = TRUE, main = "Number Unemployed", xlab = "Rate", col = "purple")
boxplot(nc2016$perUnemploy, horizontal = TRUE, main = "Percent Unemployed", xlab = "Rate", col = "purple")
And here too we see that both variables are right skewed but
the Percent Unemployed is less skewed so I will consider only the
Percent Unemployed in my future analyses.
feats <- c("ageAdjRate", "mentUnhDays", "perInsSleep", "perAccToExer", "perExcDrink", "tBirthRate", "vioCrimeRate", "perChlamyd", "perUnemploy")
pairs(nc2016[ ,feats], col = "purple")
Displayed above is a matrix of scatterplots between the
variables in which I am interested. When looking at ageAdjRate we see an
increasing linear trend with a majority of the covariates. ageAdjRate
and perAccToExer appears to have little to no trend, possibly a slight
negative linear trend. And ageAdjRate and perExcDrink has a negative
linear trend. There does not appear to be a strong correlation between
ageAdjRate and the covariates except for tBirthRate which appears to be
moderately correlated.
m_noNA <- nc2016 %>% mutate_if(is.numeric, ~replace(., is.na(.), 0))
feats <- c("ageAdjRate", "mentUnhDays", "perAccToExer", "perExcDrink", "tBirthRate", "vioCrimeRate", "perInsSleep", "perChlamyd", "perUnemploy")
corr <- cor(m_noNA[ , feats])
corrplot(corr, type = "upper", tl.cex = 1.15, tl.col = "black")
In the above correlation plot/heatmap we see that tBirthRate
has the strongest positive correlation with ageAdjRate (~0.6) and
perExcDrink has the strongest negative correlation with ageAdjRate
(~-0.4). We do not see any highly correlated covariates (correlation
> 0.9).
m_noNA <- nc2016 %>% mutate_if(is.numeric, ~replace(., is.na(.), 0))
feats <- c("ageAdjRate", "mentUnhDays", "perAccToExer", "perExcDrink", "tBirthRate", "vioCrimeRate", "perInsSleep", "perChlamyd", "perUnemploy")
corr <- cor(m_noNA[ , feats])
upper <- corr
upper[upper.tri(corr)] <- ""
upper <- as.data.frame(upper)
upper
In the above table we can see the numerical correlation value
of the covariates. We observe that all the values are less than 0.9 and
therefore may conclude that there is no collinearity or
multicollinearity.
ggpairs(nc2016[ ,feats],
upper = list(continuous = wrap(ggally_cor, stars = F, size = 5)),
lower = list(continuous = wrap(ggally_points, col='#005F61')),
diag = list(continuous = wrap(ggally_densityDiag, col = '#6600DB')))
In this view we see the scatterplots for ageAdjRate along the first column with the associated correlation coefficients along the first/top row. Along the diagonal are the density curves of each of the variables which displays the distribution of the variables in the data set.
I will continue my analysis with some statistical modeling and tests in the file 2016Mort_models_tests.html.